Goto

Collaborating Authors

 letter frequency


ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

Shen, Jeff, Smith, Lindsay M.

arXiv.org Artificial Intelligence

To enhance interpretability, we introduce a novel bijective decoding head that explicitly models permutations via the Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings. Our architectural innovations and analysis methods are applicable beyond cryptograms and offer new insights into neural network generalization and interpretability. A cryptogram is a type of puzzle in which text is encrypted using a substitution cipher, and the user's task is to recover the original plaintext by inferring the cipher used for the encryption. Users typically solve cryptograms based on prior knowledge about language letter frequency distributions and common words. Originally developed for real encryption purposes, they are now popular in newspapers and puzzle books for entertainment purposes due to their simplicity. This simplicity, however, provides a unique testbed for testing and understanding generalization and reasoning in neural networks. In a one-to-one monoalphabetic substitution cipher, each letter in a fixed alphabet is mapped to a unique substitute character; this cipher represents a bijective mapping over the alphabet. While other ciphers exist (e.g., Vigen ` ere cipher, Playfair cipher), we focus here on one-to-one monoalphabetic substitution ciphers, as the problem space is extremely large but remains structurally simple to interpret. We hereafter mean one-to-one monoalphabetic substitution cipher when we say "cipher", unless otherwise specified. More formally, let Σ be a finite alphabet of size V representing allowable characters (e.g., 26 for the English alphabet).


M-IFEval: Multilingual Instruction-Following Evaluation

Dussolle, Antoine, Díaz, Andrea Cardeña, Sato, Shota, Devine, Peter

arXiv.org Artificial Intelligence

Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.


An Analysis of Letter Dynamics in the English Alphabet

Zhao, Neil, Zheng, Diana

arXiv.org Artificial Intelligence

The tabulation of commonly used letters, as determined by letter frequency, was later utilized to improve typewriter keyboard arrangement by minimizing hand motion [5]. Statistical characteristics of different letters of the English alphabet was further studied in the context of different sentence structures [6]. The letters'B', 'S', 'M', 'H', 'C' were found to most frequently occur as the initial letters of proper nouns, while'E', 'A', 'R', 'N' were the most frequently used letters when the entire proper noun is considered. For entire text documents, the most commonly used letters were found to be'E', 'T', 'A', 'O', 'N'. Interestingly, 95% of the English vocabulary was found to be represented by 13 letters of the alphabet. Our manuscript expanded upon the statistical study of the English alphabet by evaluating letter frequency in the context of different categories of writings. We analyzed news articles, novels, plays, and scientific articles for letter frequency and distribution. As a result, we determined the information density of the letters of the alphabet. Additionally, we developed a metric called "distance, d" to act as a simple algorithm for recognizing writing category.


How to get the Letter Frequency in Python

#artificialintelligence

We will provide you a walk-through example of how you can easily get the letter frequency in documents by considering the whole document or the unique words. Finally, we will compare our observed relative frequencies with the letter frequency of the English language. From the above horizontal barplot, we can easily see that the letter e is the most common in both English Texts and Dictionaries. Notice also that the distribution is changed between Texts and Dictionaries. We will work with the Moby Dick book and we will provide the frequency and the relative frequency of the letters.


Cryptology from the crypt: How I cracked a 70-year-old coded message from beyond the grave

#artificialintelligence

In recent weeks I managed to decrypt a difficult cipher that, despite expert codebreakers' best efforts, had remained unsolved for 70 years. The code was created by the late Cambridge professor and scientist Robert Henry Thouless, who passed away in 1984. He created it as a "test of survival" to see if he could communicate with the living after his death. Thouless thought if he successfully transmitted cipher keywords to the living through spiritual mediums and the message was received, this would prove he had survived his death. In 2019, I was more interested in seeing whether computer speed, storage and networking capabilities had advanced enough to break a code that had outlived its maker.